This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCAD.2020.3003843, IEEE  
Transactions on Computer-Aided Design of Integrated Circuits and Systems  
1
DREAMPlace: Deep Learning Toolkit-Enabled GPU  
Acceleration for Modern VLSI Placement  
Yibo Lin Member, IEEE, Zixuan Jiang, Jiaqi Gu, Wuxi Li Member, IEEE, Shounak Dhar,  
Haoxing Ren, Brucek Khailany, David Z. Pan Fellow, IEEE  
Abstract—Placement for very-large-scale integrated (VLSI) circuits is  
one of the most important steps for design closure. We propose a novel  
GPU-accelerated placement framework DREAMPlace, by casting the  
analytical placement problem equivalently to training a neural network.  
Implemented on top of a widely-adopted deep learning toolkit PyTorch,  
with customized key kernels for wirelength and density computations,  
DREAMPlace can achieve around 40× speedup in global placement  
without quality degradation compared to the state-of-the-art multi-  
threaded placer RePlAce. We believe this work shall open up new  
directions for revisiting classical EDA problems with advancements in  
AI hardware and software.  
nonlinear optimization techniques [1]–[9], [17]. It formulates a non-  
linear optimization problem with a wirelength objective subjecting  
to a density constraint. By relaxing the density constraint into  
the objective, gradient descent based techniques can be adopted  
to search for a high-quality solution. In this paper, we focus on  
the nonlinear placement approach, as many commercial tools like  
Cadence Innovus [18] and Synopsys IC Compiler [19] adopt that.  
To accelerate placement, existing parallelization efforts have  
mostly targeted multi-threaded CPUs using partitioning [16], [20],  
[
21]. As the number of threads increases, speedup quickly saturates  
at around 5× in global placement with typical quality degradation  
of 2-6%. Cong et al. explored GPU acceleration for analytical  
placement [22]. They combined clustering and declustering with  
nonlinear placement optimization. By parallelizing the nonlinear  
placement part, an average of 15× speedup in global placement was  
reported with less than 1% quality degradation. Lin et al. proposed  
GPU acceleration techniques for wirelength gradient computation  
and area accumulation [23], but their experiments failed to consider  
real operations such as density cost computation, and it lacked the  
validation from real analytical placement flows. In addition, current  
research on placement is facing challenges in the lack of well-  
maintained public frameworks and the high development overhead,  
raising the bar to validate new algorithms systematically.  
In this work, we propose DREAMPlace, a GPU-accelerated ana-  
lytical placer developed with deep learning toolkit PyTorch [24] by  
casting an analytical placement problem to training a neural network.  
DREAMPlace is based on the state-of-the-art analytical placement  
algorithm ePlace/RePlAce family [6], [8], but the framework is  
designed in a generic way that is compatible with other analytical  
placers such as NTUplace [4]. The key contributions are summarized  
as follows.  
I. INTRODUCTION  
Placement is a critical but time-consuming step in the VLSI  
design flow. As it determines the locations of standard cells in  
the physical layout, its quality has significant impacts on the later  
stages in the flow, such as routing and post-layout optimization. A  
placement solution also provides relatively accurate estimation to  
routed wirelength and congestion, which is very valuable in guiding  
the earlier stages like logic synthesis. Commercial design flows often  
run core placement engines many times to achieve design closure.  
As placement involves large-scale numerical optimization, today’s  
placers usually take hours for large designs, thus, slowing down  
design iterations. Therefore, ultra-fast yet high-quality placement is  
always desired.  
Analytical placement is the current state-of-the-art for VLSI  
placement [1]–[15]. It essentially solves a nonlinear optimization  
problem. Although analytical placement can produce high-quality  
solutions, it is also known to be relatively slow [11], [13], [14],  
[
16]. Here we provide a brief introduction to the analytical placement  
problem. Suppose a circuit is described as a hypergraph H = (V, E),  
where V denotes the set of vertices (cells) and E denotes the set  
of hyperedges (nets). Let x, y denote the locations of cells. The  
objective of analytical placement is to determine the locations of  
cells with wirelength minimized and no overlap in the layout.  
Analytical placement can be roughly categorized into quadratic  
placement and nonlinear placement. Quadratic placement tackles the  
problem by iterating between an unconstrained wirelength minimiza-  
tion step and a rough legalization or spreading step [10]–[15]. The  
wirelength minimization step usually adopts a quadratic wirelength  
model and minimizes the total wirelength regardless of the overlaps  
between cells. The rough legalization step removes the overlaps  
based on heuristic approaches without explicit consideration of the  
wirelength cost. By iterating between these two steps, cells can be  
gradually spread out. Meanwhile, the wirelength cost is minimized.  
Nonlinear placement directly solves the placement problem with  
We take a totally new perspective of making an analogy  
between placement and deep learning, and build an open-  
source generic analytical placement framework that runs on  
both CPU and GPU platforms developed with modern deep  
learning toolkits.  
A variety of gradient-descent solvers are provided, such as  
Nesterov’s method, conjugate gradient method, and Adam [25],  
with the help from deep learning toolkit.  
We propose efficient GPU implementations of key kernels in  
analytical placement like wirelength and density computation.  
We demonstrate around 40× speedup in global placement  
without quality degradation of the entire placement flow over  
multi-threaded RePlAce implementations. More specifically, a  
design with one million cells finishes in one minute even with  
legalization. The framework maintains nearly linear scalability  
with industrial designs up to 10-million cells.  
Y. Lin is with the Center for Energy-Efficient Computing and Appli-  
cations, School of EECS at Peking University, Beijing, China. Email:  
yibolin@pku.edu.cn  
Z. Jiang, J. Gu, and D. Z. Pan are with The Department of Electrical and  
Computer Engineering, The University of Texas at Austin, TX, USA.  
W. Li is with Xilinx Inc., CA, USA.  
1
The source code is released on Github . To clarify, the casting  
of placement problem to deep learning problems aims at using  
the toolkit to solve placement, which is orthogonal to using deep  
learning models for placement. The rest of the paper is organized  
S. Dhar is with Intel Corp., CA, USA.  
H. Ren and B. Khailany are with NVIDIA, Austin, TX, USA.  
0
278-0070 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.  
Authorized licensed use limited to: University of Exeter. Downloaded on June 27,2020 at 06:13:12 UTC from IEEE Xplore. Restrictions apply.  
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCAD.2020.3003843, IEEE  
Transactions on Computer-Aided Design of Integrated Circuits and Systems  
n
n
X
X
Placement  
API  
Random Initial Placement  
WL Grad Density Grad  
min  
f((xi; w), yi) + R(w)  
min  
WL(ei; w) + D(w)  
w
w
i
i
Forward Propagation  
Forward Propagation  
(Compute obj)  
Nonlinear  
Optimizer  
Python  
GP  
(
Compute obj)  
Optimization & Update  
Data  
Instance  
xi, yi)  
Neural  
Network  
(
·
; w)  
Error  
Function  
f((xi; w), yi)  
Net  
Instance  
(ei, 0)  
Neural  
Network  
WL(
·
; w)  
Error  
Function  
WL(ei; w)  
Automatic  
Gradient  
N
Converge?  
(
Y
CONV  
ReLU  
a)  
WL  
LG  
DP  
Legalization  
OPs  
C++/CUDA  
Backward Propagation  
Compute Gradient @  
Backward Propagation  
@obj  
(Compute Gradient @w )  
Density  
NTUplace DP  
@
obj  
w
(
)
(
(b)  
(a)  
(b)  
Fig. 2: (a) Software architecture for placement implementation using  
deep learning toolkits. (b) DREAMPlace flow.  
Fig. 1: Analogy between neural network training and analytical  
placement. (a) Train a network for weights w. (b) Solve a placement  
for cell locations w = (x, y).  
for all data instances, and a regularization term R(w) [26]. In  
the analogy of placement to neural network training, we combine  
cell locations (x, y) into w for brevity. Each data instance is  
replaced with a net instance with a feature vector ei and a label  
zero. The neural network then takes a net instance and computes  
the wirelength cost WL(ei; w). Using the absolute error function  
as follows. Section II describes the background and motivation;  
Section III explains the detailed implementation; Section IV demon-  
strates the results; Section V concludes the paper.  
II. PRELIMINARIES  
f( yˆ , y) = | yˆ  y| and noting that wirelength is non-negative,  
This section will review the background and motivation.  
Pn  
the minimization of prediction errors becomes  
WL(ei; w). The  
i
density cost D(w) corresponds to the regularization term R(w), as  
it is not related to net instances. With this construction, we find a  
one-to-one mapping of each component in analytical placement to  
neural network training, which makes it possible to take advantage  
of recent developments in deep learning toolkits for implementation.  
Then, we can solve the placement problem following the neural  
network training procedure, with forward propagation to compute  
the objective and backward propagation to calculate the gradient.  
Deep learning toolkits nowadays consist of three stacks, low-level  
operators (OPs), automatic gradient derivation, and optimization  
engines, as shown in Figure 2a. Toolkits like TensorFlow and  
PyTorch offer mature and efficient implementation of these three  
stacks with compatibility to both CPU and GPU acceleration. The  
toolkits also provide convenient APIs to extend the existing set  
of low-level operators. Each custom operator requires well defined  
forward and backward functions for cost and gradient computation.  
To develop an analytical placement with deep learning toolkits, we  
only need to implement the custom operators for wirelength and  
density cost in C++ and CUDA. Then we can construct a placement  
framework in Python with very low development overhead and easily  
incorporate a variety of optimization engines in the toolkit. The  
placement framework can run on both CPU and GPU platforms. The  
conventional development of placement engines takes huge efforts  
in building the entire software stacks with C++. Thus, the bar of  
designing and validating a new placement algorithm is very high  
due to the development overhead. Taking advantage of deep learning  
toolkits, researchers can concentrate on the development of critical  
parts like low-level operators and high-level optimization engines.  
A. Analytical Placement  
Analytical placement usually consists of three steps: global  
placement (GP), legalization (LG), and detailed placement (DP).  
Global placement spreads out cells in the layout with a target cost  
minimized; legalization removes the remaining overlaps between  
cells and aligns cells to placement sites; detailed placement performs  
incremental refinement to further improve the quality. Usually,  
global placement is the most time-consuming portion in analytical  
placement.  
Global placement aims at minimizing the wirelength cost subject-  
ing to density constraints. The formulation can be written as follows,  
X
min  
WL(e; x, y),  
(1a)  
(1b)  
x,y  
eE  
s.t. d(x, y)  dt,  
where WL(·; ·) is the wirelength cost function that takes any net  
instance e and returns the wirelength, d(·) is the density of a location  
in the layout, and dt is a given target density. A typical solving  
approach is to relax the density constraints to the objective as a  
density penalty [1], [4], [6],  
X
min  
(
WL(e; x, y)) + λD(x, y),  
(2)  
x,y  
eE  
where D(·) is the density penalty to spread cells out in the layout.  
The density constraints can be satisfied by gradually increasing the  
weight of λ.  
B. Analogy to Deep Learning  
C. The ePlace/RePlAce Algorithm  
As both solving an analytical placement and training a neural  
network are essentially solving a nonlinear optimization problem,  
we investigate the underlying similarity between the two problems:  
the analogy of the wirelength cost to the error of misprediction  
and that of the density cost to the regularization term. Figure 1  
shows the objective functions of the two problems. In neural network  
training, each data instance with a feature vector xi and a label  
yi is fed to the network, and the neural network predicts a label  
φ(xi; w). The task for training is to minimize the overall objective  
over weights w, where the objective consists of the prediction errors  
ePlace/RePlAce is a state-of-the-art family of global placement  
algorithms that model the layout and netlist as an electrostatic system  
[
6]–[8]. It uses weighted-average wirelength (WA) for wirelength  
cost originally proposed by [27], [28],  
x
x
i
P
i
P
xie  
x
xie γ  
γ
ie  
ie  
WAe =  
x
e γ  
,
(3)  
P
i
P
i
γ
e
ie  
ie  
where γ is a parameter to control the smoothness and accuracy of the  
approximation to half-perimeter wirelength (HPWL). The smaller γ  
is, the more accurate it is to approximate HPWL, but the less smooth.  
0
278-0070 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.  
Authorized licensed use limited to: University of Exeter. Downloaded on June 27,2020 at 06:13:12 UTC from IEEE Xplore. Restrictions apply.  
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCAD.2020.3003843, IEEE  
Transactions on Computer-Aided Design of Integrated Circuits and Systems  
TABLE I: Notations  
GP-Nonlinear  
GP-Nonlinear  
Notation  
Description  
Notation  
Description  
6
5.0  
7
1.9  
V
P
Set of cells  
Set of pins  
E
B
Set of nets  
Set of bins  
minie xi, e  E  
2.2  
1.7  
+
LG  
DP  
1.1  
5.9  
LG  
DP  
xe  
maxie xi, e  E  
xe  
1
+
e
e
xix  
xix  
+
a
a
21.1  
21.1  
γ
γ
e
, i  e, e  E  
e
, i  e, e  E  
i
i
P
P
+
+
i
be  
a
, e  E  
be  
a
, e  E  
ie  
ie  
i
GP-IP  
GP-IP  
P
P
+
+
i
ce  
xia , e  E  
ce  
xia , e  E  
ie  
ie  
i
(a)  
(b)  
x+  
a+  
b+  
c+  
{xe }, e  E  
+
x
{xe }, e  E  
+
{a }, i  P  
a
{a }, i  P  
Fig. 3: RePlAce [8] runtime breakdown in percentages on  
bigblue4 (2 million cells). (a) 1 thread; (b) 10 threads.  
i
i
+
{be }, e  E  
b
{be }, e  E  
+
{ce }, e  E  
c
{ce }, e  E  
initial placement [4], [6]. We observe that starting from a random  
initial placement also achieves the same quality (< 0.04% differ-  
ence) with significantly less runtime (21.1% in Figure 3). In initial  
placement, standard cells are placed in the center of the layout  
with a small Gaussian noise. In our experiments, the scales of the  
noise are set to 0.1% of the width and height of the placement  
region. The kernel global placement iterations refer to the loop  
that involves the computation of wirelength and density gradient,  
optimization engines, and cell location updating. After the global  
placement converges, legalization is performed to remove remaining  
overlaps and align cells to placement sites. The last step before  
the output is detailed placement to refine the placement solutions  
relying on NTUplace3 [4]. The rest of this section will focus on  
GPU acceleration to the ePlace/RePlAce algorithm [6], [8].  
Its density penalty is quite different from other analytical placers  
1], [3], [4]. With analogy to an electrostatic system, cells are  
modeled as charges, density penalty is modeled as potential energy,  
and density gradient is modeled as the electric field. The electric  
potential and field distribution can be computed by solving Poisson’s  
equation from the charge density distribution.  
[
· ∇ψ(x, y) = ρ(x, y),  
(4a)  
(4b)  
nˆ · ∇ψ(x, y) = 0, (x, y)  ∂R,  
ZZ  
ZZ  
ρ(x, y) =  
ψ(x, y) = 0,  
R
(4c)  
R
where R denotes the placement region, ∂R denotes the boundary  
to the region, nˆ denotes the outer normal vector of the placement  
region, ρ denotes the charge density, and ψ denotes the electric  
potential.  
A. Wirelength Forward and Backward  
The numerical solution of Poisson’s equation can be obtained with  
As RePlAce adopts WA wirelength, we also use it as an example  
for the GPU acceleration to wirelength forward and backward.  
Similar insights also apply to other wirelength costs like log-sum-  
exp (LSE) [29], which is also implemented in the framework. For  
brevity, we only discuss the equations in the x dimension, as those  
in the y dimension are similar. The real implementation will separate  
the computation for x and y into different GPU streams as they are  
independent.  
2
M
πu  
and  
spectral methods. Given an M ×M grid of bins and wu =  
M
solution can be computed as follows [6],  
2
πv  
with u = 0, 1, . . . , M  1, v = 0, 1, . . . , M  1, the  
wv =  
M1 M1  
X X  
1
au,v =  
ρ(x, y) cos (wux) cos (wvy),  
(5a)  
(5b)  
(5c)  
(5d)  
M2  
x=0 y=0  
M1 M1  
X X  
au,v  
+ wv2  
ψDCT(x, y) =  
cos (wux) cos (wvy),  
sin (wux) cos (wvy),  
+ wv2  
cos (wux) sin (wvy),  
2
w
Direct implementation of WA wirelength defined in Equa-  
u
x
i
u=0 v=0  
γ
M1 M1  
tion (3) may result in numerical overflow, so we convert e  
to  
X X  
x
max  
x
x
i
to e  
min  
x
je j  
au,vwu  
i
je  
j
x
X
DSCT  
i
ξ
(x, y) =  
(x, y) =  
γ
γ
γ
2
e
, and e  
in Equation (3), which  
w
u
u=0 v=0  
is an equivalent transformation. With the notations in Table I, the  
gradient of WA wirelength to a pin location can be written as,  
M1 M1  
X X  
au,vwv  
Y
DCST  
ξ
2
+ wv2  
w
xi  
γ
+
1
γ
+
e
xi  
γ
(b )2  
e
1
u
(1 + )b  
c
(1  )b + γ c  
u=0 v=0  
WLe  
∂xi  
e
e
e
+
=
· ai  
· a  
.
(6)  
i
(b )2  
+
where ψDCT denotes the numerical solution of the potential func-  
e
X
DSCT  
Y
tion, and ξ  
and ξ  
denote the solution of the electric  
DCST  
field in horizontal and vertical directions respectively. Equation (5)  
requires Discrete Cosine Transform (DCT) and Inverse Discrete  
Cosine Transform (IDCT) routines to solve efficiently. The detailed  
computation is explained in Section III. With the electric field  
defined for each bin, the density gradient of each cell is the overall  
force taken by the cell in the system.  
After defining wirelength cost and density penalty, RePlAce  
adopts gradient-descent optimizers, such as Nesterov’s method and  
conjugate gradient method, to solve the optimization problem.  
RePlAce was implemented with multi-threading support [8]. The  
runtime breakdown for RePlAce [8] is elaborated in Figure 3. GP  
including initial placement (GP-IP) and nonlinear optimization (GP-  
Nonlinear) takes about 90% of the runtime with both single thread  
and 10 threads. Therefore, accelerating GP is the most effective in  
reducing the overall runtime.  
A native parallelization scheme is to allocate one thread for  
each net. This scheme has also been discussed in [23], which only  
demonstrated limited speedup because the maximum number of  
threads to allocate is |E|, and the workload for each thread is  
imbalanced due to the heterogeneity of net degrees.  
Noting that the total number of pins |P| is much larger than  
|E|, we consider the possibility of pin-level parallelization. The  
dependency graph for WA wirelength forward and backward is  
elaborated in Figure 4a. A straight-forward implementation of this  
±
±
±
pin-level parallelism is to compute a , b , c in separate CUDA  
kernels by using multiple CUDA streams. The computation can be  
±
completed in four steps: 1) compute x ; 2) compute and store  
±
±
±
a ; 3) compute and store b , c ; 4) compute WL in forward or  
e
WLe  
in backward. Algorithm 1 illustrates this multi-stream version  
xi  
of pin-level parallel implementation of WA wirelength forward  
and backward functions. We make all the CUDA kernel functions  
inline, which should be separate in practice, for brevity. Specifically,  
III. THE DREAMPLACE ALGORITHMS  
+
Our overall placement flow is given in Figure 2b. It is slightly  
different from the typical one that starts from a bound-to-bound  
computations for an array with different ± signs, e.g., x and x ,  
are separated into different CUDA streams in the implementation. In  
0
278-0070 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.  
Authorized licensed use limited to: University of Exeter. Downloaded on June 27,2020 at 06:13:12 UTC from IEEE Xplore. Restrictions apply.  
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCAD.2020.3003843, IEEE  
Transactions on Computer-Aided Design of Integrated Circuits and Systems  
Algorithm 1 Wirelength Forward and Backward Atomic [30]  
+
e
xe  
x
Require: A set of nets E, a set of pins P, and pin locations x;  
Ensure: Wirelength cost and gradient;  
+
ai  
au,v  
ai  
1: function Forward(E, P, x)  
+
← −∞, x  ∞, b  0, c±  0;  
±
2:  
x
+
+
e
b
e
ce  
b
c
±
e
3:  
for each thread 0  t < |P| do  
at.  
. x kernel  
@
@
WLe  
xi  
@D  
@xi  
4:  
Define e as the net that pin t belongs to;  
WLe  
Forward  
D
+
+
e
5:  
xe  −− max(x , xt);  
. atomic max  
. atomic min  
at.  
 −− min(x , xt);  
Backward  
Forward  
Backward  
(b)  
6:  
x
e
e
7:  
8:  
9:  
end for  
(
a)  
±
for each thread 0  t < |P| do  
. a kernel  
Fig. 4: Forward and backward dependency graph for (a) weighted  
average wirelength and (b) density computation.  
Define e as the net that pin t belongs to;  
±
xtxe  
±
e±  
γ
1
0:  
11:  
12:  
a
;
t
±
end for  
the algorithm, six kernels are needed. The x kernel requires atomic  
±
±
±
for each thread 0  t < |P| do  
. b kernel  
maximum and minimum operations, and the b , c kernels require  
atomic addition. At the end of the forward function, summation  
reduction is needed to compute the overall wirelength cost, which  
is provided by the deep learning toolkit. In our implementation,  
multiple CUDA streams are adopted for independent computations,  
such as x/y directions and positive/negative components.  
1
1
1
1
1
1
1
2
2
3:  
4:  
5:  
6:  
7:  
8:  
9:  
0:  
1:  
Define e as the net that pin t belongs to;  
±
at.  
±
±
t
b
 −− b + a ;  
. atomic add  
e
e
end for  
±
for each thread 0  t < |P| do  
. c kernel  
Define e as the net that pin t belongs to;  
±
at.  
±
±
t
c
 −− c + xta ;  
. atomic add  
. WLe kernel  
e
e
We observe that Algorithm 1 [30] has several drawbacks: ex-  
pensive CUDA streams, sequential launches of many kernels, con-  
tention, and frequent global memory access. Among these draw-  
backs, frequent global memory access, especially frequent writing to  
end for  
for each thread 0  t < |E| do  
th  
Define e as t net in E;  
Compute WL ←  
+
c
c
e
+
e
±
±
±
±
22:  
;
e
intermediate variables x , a , b , c , becomes the major runtime  
bottleneck. In other words, it is memory bounded rather than  
computation bounded. Thus, we review the natural net-by-net and  
pin-by-pin approaches again. We discover that the net-by-net strategy  
has the potential to remove all the intermediate variables by merging  
the forward and backward functions, as shown in Algorithm 2.  
be  
be  
2
2
2
2
2
2
2
3
3
3
3:  
4:  
end for  
return reduce(  
5: end function  
6: function Backward(E, P, x, a , b , c )  
P
± ± ±  
WLe), a , b , c ;  
eE  
±
±
±
∂xt  
WLe  
kernel  
7:  
8:  
9:  
0:  
1:  
for each thread 0  t < |P| do  
.
±
±
±
±
Define e as the net that pin t belongs to;  
Instead of storing x , a , b , c in global memory, we only create  
local variables in the kernel function, and directly compute the  
wirelength for each net and the gradient for each pin. Although  
∂xt  
WLe  
;
Compute  
end for  
∂xi  
WLe  
}, i P;  
±
return {  
variable a is computed twice, the store instructions only happen  
to the variables WLe and  
WLe  
, which significantly alleviate the  
xp  
2: end function  
memory pressure. The efficiency of the two algorithms is empirically  
compared in Section IV-B.  
Cells  
1
Bins  
1
Cells  
1
Bins  
1
For parallel CPU implementation, we adopt the net-by-net strategy  
and dynamic scheduling for heterogeneous net degrees. We observe  
|
E|  
that a chunk size of #  
|
works well for most designs, where  
threads×16  
2
3
4
2
3
4
2
3
4
2
3
4
E| is the number of nets in the design.  
B. Density Forward and Backward  
Forward and backward of density cost is a computation-intensive  
procedure. Figure 4b plots the dependency graph for density cost  
forward and backward. The computation consists of four steps:  
(
a)  
(b)  
1
2
3
4
) compute density map ρ;  
) compute au,v;  
) compute ψ in forward or ξ in backward;  
) compute D in forward or  in backward.  
Fig. 5: Computation flow of (a) density map; (b) electric force.  
xi  
D
2D histogram problem or a dynamic bipartite graph forward, as  
shown in Figure 5a. Each edge in the bipartite graph represents  
an update to the entry of the target bin in the density map, where  
the edge weight represents the overlapping area of the {cell, bin}  
pair. The reason why we call it “dynamic” is that, as cells move,  
edges in the bipartite graph, which indicate overlaps between cells  
and bins, will change accordingly.  
A naive algorithm to parallelize this step is to allocate one GPU  
thread for each cell and use atomic addition to accumulate the  
overlapping areas with bins [30]. However, as a cell may cover  
multiple bins, simply using one GPU thread to update all overlapped  
bins sequentially will cause load imbalance problem due to the  
We model this computation flow as a dynamic bipartite graph  
forward and backward process, as shown in Figure 5. First, density  
map calculation is modeled as a bipartite graph forward or a special  
D histogram problem where one cell may update multiple bins  
31]. Then the electric potential and field are solved via DCT and  
other Fourier-related transforms. Finally, the electric force inflicted  
on each cell is collected from its overlapped bins, which can be  
modeled as a 2D gathering problem [31].  
2
[
1) Dynamic Bipartite Graph Forward for Density Map: Each step  
of density map computation updates bins based on the overlapping  
area of corresponding cells. Thus, it can be modeled as a particular  
0
278-0070 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.  
Authorized licensed use limited to: University of Exeter. Downloaded on June 27,2020 at 06:13:12 UTC from IEEE Xplore. Restrictions apply.  
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCAD.2020.3003843, IEEE  
Transactions on Computer-Aided Design of Integrated Circuits and Systems  
Algorithm 2 Wirelength Forward and Backward Merged  
For parallel CPU implementation, we adopt the native atomic  
operations and dynamic scheduling for heterogeneous cell sizes. We  
Require: A set of nets E, a set of pins P, and pin locations x;  
Ensure: Wirelength cost and gradient;  
|
V |  
set the chunk size to #  
in the design.  
, where |V | is the number of cells  
threads×16  
1
2
3
4
5
6
7
8
: function Forward_Backward(E, P, x)  
WLe  
xp  
:
:
:
:
:
:
:
for each thread 0  t < |E| do  
. WLe,  
kernel  
2) Dynamic Bipartite Graph Backward for Electric Force: In the  
electric force computation, each cell receives the forces from the  
bins it overlaps with. Thus, the computation can be viewed as a 2D  
gathering problem or a dynamic bipartite graph backward, as shown  
in Figure 5b. Each edge represents the force from a bin, and the  
edge weight is the amount of the force. The weight is computed as  
the product of the overlapping area between the cell and the bin and  
the electric field at the bin.  
A natural strategy to accelerate this step is to allocate one  
thread for each cell and accumulate the forces sequentially from  
its overlapping bins [30]. However, considering this computation  
task shares a similar structure with the density map computation,  
we borrow the same idea from Section III-B1 by sorting the cells  
and allocating multiple threads for each cell.  
Define e as the net corresponds to thread t;  
+
±
e
xe  maxpe xp;  
. x are local in the kernel  
x
 minpe xp;  
e
±
±
±
. b , c are local in the kernel  
e
. WLe is in the global memory  
±
b
 0, c  0;  
e
e
e
WLe  0;  
for each pin p  e do  
±
xpx  
e
±
±
±
γ
9
:
a
b
c
e  
;
. a is local in the loop  
p
p
±
±
±
p
1
1
1
1
1
0:  
1:  
2:  
3:  
4:  
 b + a ;  
e
e
±
e
±
±
 ce + xp · a ;  
p
end for  
c+  
+
b
c
WLe b  
;
for each pin p  e do  
±
xpx  
e
±
±
±
γ
1
1
1
1
1
2
5:  
6:  
7:  
8:  
9:  
a
e  
;
. Compute ap again  
3) DCT/IDCT for Electric Potential and Field: The electric po-  
tential and field computation in Equation (5) requires fast DCT/IDCT  
kernels for efficient calculation. The standard DCT/IDCT for one-  
dimensional (1D) length-N sequence x is,  
p
WLe  
WLe  
is in the global memory  
∂xp  
Compute  
;
.
xp  
end for  
end for  
WLe  
xp  
return reduce({WLe}), {  }, p  P, e  E;  
N1  
X
π
1
0: end function  
DCT({xn})k =  
xn cos ( (n + )k),  
(7a)  
(7b)  
N
2
n=0  
N1  
X
1
π
1
IDCT({xn})k = x0 +  
xn cos ( n(k + )),  
2
1
1
0
0
.0  
.5  
.0  
.5  
.0  
2
N
2
n=1  
float64  
float32  
where k = 0, 1, ꢀ ꢀ ꢀ , N  1. We further derive IDXST as,  
N1  
X
π
1
IDXST({xn})k =  
xn sin ( n(k + )),  
(8a)  
(8b)  
(8c)  
N
2
n=0  
N1  
1
2
X
πn(k +  
)
),  
k
k
k
=
=
(1)  
xn(1) sin (  
N
n=0  
N1  
1
)
X
π(N  n)(k +  
2
),  
(1)  
(1)  
xn cos (  
Fig. 6: Comparison of different numbers of threads to update one  
cell in density forward and backward on bigblue4. The numbers  
are normalized by the runtime of 1 × 1 thread with float64.  
N
n=0  
N1  
X
π
1
k
k
=
=
xNn cos ( n(k + )),  
(8d)  
(8e)  
N
2
n=0  
(1) IDCT({xNn})k,  
variety in cell sizes. Empirically, the number of bins covered by a  
cell can vary from  10 to  1000. This ill-balanced workload  
within a thread warp introduces a big chunk of idle time and  
significantly degrades the performance. Therefore, we develop the  
following techniques to address this issue.  
Sort cells by area. We sort the standard cells by their areas, such  
that the 32 threads in a warp can process 32 consecutively-indexed  
cells with similar sizes. In this way, the cell-level workloads will be  
automatically balanced within a warp.  
Update one cell with multiple threads. We use multiple threads  
to update a single cell, which can effectively reduce the workload  
of each thread. Thus the issue of load imbalance can be further  
alleviated. An appropriate number of threads need to be selected  
given that this fine-grained parallelism inevitably introduces some  
runtime penalty. Specifically, more computational redundancy and  
memory write contention from atomic operations will happen among  
threads updating the same cell. We experimentally evaluate dif-  
ferent settings of threads. Figure 6 shows the comparison on the  
bigblue4 benchmark. Based on the above results, we empirically  
adopt 2 × 2 threads, i.e., 2 threads for both vertical and horizontal  
directions. It provides about 20  30% runtime improvement with  
both float32 and float64.  
where xN = 0. The equality between Equation (8d) and Equa-  
tion (8e) can be derived by incorporating xNn into Equation (7b).  
Given an M ×M density map ρ, the electric potential and field can  
be computed using DCT/IDCT, IDXST routines.  
T
T
au,v = DCT(DCT(ρ) ) ,  
(9a)  
(9b)  
au,v  
2
T T  
ψDCT = IDCT(IDCT({  
}) ) ,  
wu + wv2  
au,vwu  
X
DSCT  
T
T
ξ
ξ
= IDXST(IDCT({  
= IDCT(IDXST({  
}) ) ,  
(9c)  
(9d)  
2
+ wv2  
w
u
au,vwv  
Y
DCST  
T T  
}) ) ,  
+ wv2  
2
w
u
T
where (·) denotes matrix transposition. The two-dimensional (2D)  
DCT/IDCT is computed by performing 1D DCT/IDCT to columns  
and then rows. We can see all the computations can be broken down  
into the 1D DCT/IDCT kernels with proper transformations. Thus,  
highly optimized DCT/IDCT kernels are critical to the performance.  
As the highly optimized fast Fourier transform (FFT) is provided  
by many deep learning toolkits, we leverage FFT to compute DCT.  
There are multiple ways to compute DCT using FFT with linear  
time additional processing. For example, TensorFlow adopts the  
implementation using 2N-point FFT. We choose the N-point FFT  
0
278-0070 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.  
Authorized licensed use limited to: University of Exeter. Downloaded on June 27,2020 at 06:13:12 UTC from IEEE Xplore. Restrictions apply.  
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCAD.2020.3003843, IEEE  
Transactions on Computer-Aided Design of Integrated Circuits and Systems  
Algorithm 3 DCT/IDCT with N-Point FFT  
C. Density Weight Updating  
Require: An even-length real sequence x;  
We need to update the density weight λ in Equation (2) in each  
iteration to penalize the density cost. RePlAce [8] uses the following  
equations to update λ.  
Ensure: An even-length transformed real sequence y;  
1
2
3
4
5
6
7
8
9
: function DCT(x)  
:
:
:
:
:
:
:
:
N  |x|;  
(
for each thread 0  t < N do  
. Reorder kernel  
N
2
µmax,  
if p < 0;  
1p  
), otherwise;  
max  
if t <  
then  
µ ←  
,
(18a)  
0
max(µmin, µ  
x  x2t;  
t
else  
λ λ · µ,  
where µmin = 095, µmax = 105 and p =  
(18b)  
0
x  x2(Nt)1;  
t
3.5×10  
HP W L  
. We follow  
5
end if  
end for  
almost the same scheme with one minor difference. When p < 0,  
k
0
0
0
we set µ  µmax · max(09999 , 098) instead of µmax, where k  
1
0:  
x  RFFT(x );  
. One-sided real FFT kernel  
jπt  
is the current iteration. This equation indicates that from iteration 0  
to 200, µ gradually drops from 1.05 to 1.03 and keeps this value  
afterward, given the previous µmax setting. We found that this minor  
change provides relatively stable convergence in our experiments.  
1
1
1:  
2:  
for each thread 0  t < N do  
. e 2N kernel  
N
2
if t ≤  
then  
jπt  
2
00  
1
1
1
1
1
1
1
2
2
2
3:  
4:  
5:  
6:  
7:  
8:  
yt  N <(x e 2N );  
. get real part  
. get real part  
t
else  
jπt  
2N );  
2
00  
yt  N <(x  
e
(Nt)  
D. Optimization Engine  
end if  
ePlace/RePlAce [6], [8] uses Nesterov’s method as the gradient-  
descent solver with a Lipschitz-constant approximation scheme for  
line search. We implement the same approach in Python leveraging  
the efficient API provided by the deep learning toolkit. The frame-  
work is compatible with other well-known solvers in deep learning  
toolkits, i.e., various momentum-based gradient descent algorithms  
like Adam [25] and RMSProp [33], providing additional solver  
options.  
end for  
return y;  
9: end function  
0: function IDCT(x)  
1:  
2:  
N  |x|;  
N
2
for each thread 0  t < + 1 do  
. Complex kernel  
. let xN  0  
jπt  
x  (xt  jx(Nt))e 2N  
0
2
2
2
2
2
2
2
3
3:  
4:  
5:  
6:  
7:  
8:  
9:  
0:  
;
t
end for  
0
0
0
x  IRFFT(x );  
. One-sided real IFFT kernel;  
. Reverse kernel  
E. Legalization  
for each thread 0  t < N do  
if t mod 2 == 0 then  
We also develop legalization as an operator in DREAMPlace. It  
first follows the Tetris-like procedure similar to NTUplace3 [4]. Then  
it performs Abacus row-based legalization [34]. This step copies the  
cell locations from GPU to CPU and executes legalization purely on  
CPU because we observe that it only takes several seconds even for  
million-size designs with a single CPU thread.  
N
4
00  
yt ←  
x
t
2
;
else  
N
4
00  
yt ←  
x
t+1 );  
(N−  
2
3
3
3
3
1:  
2:  
3:  
end if  
end for  
return y;  
4: end function  
F. Extension to Consider Routability  
To optimize routing congestion, we adopt cell inflation to optimize  
congested regions [35]. We follow a similar scheme to RePlAce [8],  
which invokes the NCTUgr global router [36] to get the routing  
overflow map during placement iterations. For each metal layer, we  
compute the ratio between routing demand and capacity at each  
routing tile. Then we use the maximum ratio across all layers to  
compute the inflation ratio for each tile.  
implementation [32] and demonstrate better efficiency in the exper-  
iments, as shown in Algorithm 3. Due to the symmetric property of  
FFT for real input sequences, we utilize one-sided real FFT/IFFT to  
save almost half of the sequence. With additional processing kernels  
like linear-time reordering and multiplication, DCT/IDCT can be  
computed with an N-point real FFT/IFFT.  
demandl 2.5  
ratio = min((max  
)
, 25),  
(19)  
lL capacityl  
where L is the set of metal layers. The exponent and maximum  
limits can be adjusted according to the benchmarks. We choose 2.5  
in the experiments. After that, we obtain an inflation ratio map. A  
cell will be inflated according to the inflation ratios of the tiles it  
overlaps with. If cells inflate too much, there may not be enough  
total whitespace to digest the area increment. Thus, we limit the  
area increment to be 10% of the total whitespace area in the layout  
every time. If the attempted area increment exceeds this ratio, we  
uniformly scale down the inflation ratio for each cell. During the  
placement iterations, once the cell overflow drops to 20%, we invoke  
the global router and perform inflation. The overflow will increase  
after inflation. Then, the solver is restarted to optimize wirelength  
and density again. We keep on looping until the total inflation ratio  
is less than 1% of the total cell area, or we reach a maximum of  
5 times of inflation. Starting from the first round of cell inflation,  
In the placement problem, we need to compute 2D DCT/IDCT.  
A widely adopted algorithm aforementioned is to perform 1D  
DCT/IDCT through the rows and columns sequentially [30]. This  
row-column DCT algorithm is easy to implement but limited by  
its two-step procedure, redundant computation, and frequent mem-  
ory transaction. To achieve better efficiency, we implement 2D  
DCT/IDCT directly through 2D FFT, proven in [32]. Algorithm 4 il-  
lustrates the 2D DCT/IDCT implementation with 2D pre-processing  
and post-processing kernels. This implementation eliminates unnec-  
essary computations with a one-time call to 2D FFT kernels. The pre-  
and post-processing routines can be fully parallelized. This algorithm  
is adopted for both GPU and CPU implementations. We evaluate the  
efficiency of the DCT/IDCT transforms and the density operator in  
Section IV-B.  
0
278-0070 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.  
Authorized licensed use limited to: University of Exeter. Downloaded on June 27,2020 at 06:13:12 UTC from IEEE Xplore. Restrictions apply.  
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCAD.2020.3003843, IEEE  
Transactions on Computer-Aided Design of Integrated Circuits and Systems  
Algorithm 4 2D DCT, 2D IDCT, IDCT IDXST, and IDXST IDCT with N-Point 2D FFT  
Require: An real N1 × N2 matrix x;  
. N1 and N2 can be any positive number  
1
2
: function 2D_DCT(x)  
0
:
x = 2d_dct_preprocess(x) using Equation (10),  
N11  
N21  
x(2n1, 2n2),  
0 6 n1 6  
, 0 6 n2 6  
,
2
2
N1+1  
N21  
x(2N1  2n1  1, 2n2),  
6 n1 6 N1  1, 0 6 n2 6  
,
0
2
2
x (n1, n2) =  
ꢁ ꢀ  
(10)  
N11  
N2+1  
x(2n , 2N  2n  1),  
0 6 n 6  
,
6 n 6 N  1,  
1
2
2
1
2
2
2
2
N1+1  
N2+1  
x(2N1  2n1  1, 2N2  2n2  1),  
6 n1 6 N1  1,  
6 n2 6 N2  1;  
2
2
0
0
0
3
4
:
:
x = 2D_RFFT(x );  
. 2D real FFT kernel  
0
0
return y = 2d_dct_postprocess(x ) using Equation (11),  
jπn  
N2  
jπn  
2N1  
jπn  
2
1
1
2N1  
x (n1, n2) + e  
x (N1  n1, n2)  
2  
00  
00  
y(n1, n2) = 2< e  
e
,
(11)  
0
0
00  
where x (N1, n2) = x (n1, N2) = 0, n1, n2;  
5
6
7
: end function  
: function 2D_IDCT(x)  
0
:
x = 2d_idct_preprocess(x) using Equation (12)  
jπn1  
jπn2  
2N2  
e
ꢄꢅ  
0
2
N1  
x (n1, n2) = e  
x(n1, n2)  x(N1  n1, N2  n2)  j x(N1  n1, n2) + x(n1, N2  n2)  
,
(12)  
where x(N1, n2) = x(n1, N2) = 0, n1, n2;  
0
0
0
8
9
:
:
x = 2D_IRFFT(x );  
. 2D real inverse FFT kernel  
0
0
1  
00  
return y = 2d_idct_postprocess(x ) = 2d_dct_preprocess (x ) using Equation (13)  
0
0
n1 n2  
x ( ,  
),  
n1 is even, n2 is even,  
n1 is odd, n2 is even,  
2
2
00  
n1+1 n2  
x (N1 −  
,
),  
),  
2
2
n2+1  
2
y(n1, n2) =  
(13)  
0
0
n1  
x ( , N −  
1 2  
n is even, n is odd,  
2
2
0
0
n1+1  
n2+1  
2
x (N1 −  
, N2 −  
), n1 is odd, n2 is odd;  
2
1
1
1
0: end function  
1: function IDCT_IDXST(x)  
0
2:  
x = idct_idxst_preprocess(x) using Equation (14)  
(
x(n1, N2  n2), n2 6= 0,  
n2 = 0;  
0
x (n1, n2) =  
(14)  
(15)  
0
,
0
0
0
1
1
3:  
4:  
x = 2D_IDCT(x );  
return y = idct_idxst_postprocess(x ) using Equation (15)  
0
0
n2 00  
y(n1, n2) = (1) x (n1, n2);  
1
1
1
5: end function  
6: function IDXST_IDCT(x)  
0
7:  
x = idxst_idct_preprocess(x) using Equation (16)  
(
x(N1  n1, n2), n1 6= 0,  
n1 = 0;  
0
x (n1, n2) =  
(16)  
(17)  
0
,
0
0
0
1
1
8:  
9:  
x = 2D_IDCT(x )  
return y = idxst_idct_postprocess(x ) using Equation (17)  
0
0
n1 00  
y(n1, n2) = (1) x (n1, n2);  
2
0: end function  
we slow down the density weight updating to make the gradient  
descent more stable. That is, we update the density weight λ every  
differentiable timing costs in the objective [29], [37]. Fence regions  
can be implemented by introducing multiple electric fields, e.g., one  
for each region, to enable independent spreading between regions.  
5
iterations instead of every iteration.  
IV. EXPERIMENTAL RESULTS  
G. Other Possible Extensions  
The framework was developed in Python with PyTorch for  
optimizers and API, and C++/CUDA for low-level operators. The  
CPU parallelism was implemented with OpenMP for wirelength  
and density operators. Both the DREAMPlace and the RePlAce [8]  
The framework is general and can be extended to consider various  
advanced design objectives and constraints, e.g., timing and fence  
regions. Timing can be considered by net weighting or additional  
0
278-0070 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.  
Authorized licensed use limited to: University of Exeter. Downloaded on June 27,2020 at 06:13:12 UTC from IEEE Xplore. Restrictions apply.  
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCAD.2020.3003843, IEEE  
Transactions on Computer-Aided Design of Integrated Circuits and Systems  
programs run on a Linux server with 40-core Intel E5-2698 v4 @  
.20GHz and 1 NVIDIA Tesla V100 GPU based on Volta architec-  
ture. ISPD 2005 contest benchmarks [38] and large industrial designs  
were adopted. We conducted experiments with both double-precision  
B. Acceleration of Low-Level Operators  
2
We further investigate the efficiency of the low-level operators,  
e.g., wirelength forward and backward, DCT/IDCT, and density for-  
ward and backward. Figure 10 compares three approaches discussed  
in Section III-A. “Net-by-Net” denotes the net-level parallelization;  
(
float64) and single-precision (float32) floating point numbers  
on CPU and GPU. We use the same dimensions of bins as RePlAce.  
“Atomic” denotes the pin-level parallelization with atomic operations  
in Algorithm 1 [30]; “Merged” denotes the combined forward and  
backward implementation in Algorithm 2. When using float32  
on GPU, the merged approach achieves 3.7× speedup over the net-  
by-net one and 1.8× speedup over the atomic one. On CPU, the  
atomic strategy is 20% slower than the net-by-net strategy with 40  
threads, while the merged strategy is over 30% faster. Meanwhile, a  
promising speedup factor of 7.5× from a single thread to 40 threads  
can be achieved with the net-by-net strategy.  
A. Placement Acceleration  
Table II and Table III show the HPWL and runtime details  
on ISPD 2005 and industrial benchmarks. With almost the same  
solution quality (within 0.3% difference on average), DREAMPlace  
running on GPU is able to achieve 38× and 47× speedup in  
GP on the two benchmark suites compared to RePlAce with 40  
threads. DREAMPlace running on CPU is also 2× faster than  
RePlAce with 40 threads in GP. RePlAce [8] crashed on the 10-  
million-cell industrial benchmark at the 6th iteration for Nesterov’s  
optimization. The potential reason is that the peak memory usage of  
RePlAce exceeded the maximum memory (64GB). Before crashing,  
it took 3396 seconds for initial placement and on average 7.5  
seconds for each Nesterov iteration. As this benchmark takes 1000  
iterations with DREAMPlace, we made a runtime estimation of  
Figure 11 compares the 2D DCT/IDCT implementation using 2N-  
point FFT (“DCT-2N” and “IDCT-2N”), N-point FFT (“DCT-N”  
and “IDCT-N”), and N-point 2D FFT (“DCT-2D-N” and “IDCT-  
2
D-N”) [32]. Considering the map sizes in the experiment (from  
5
12 × 512 to 4096 × 4096) with float32, the N-point DCT im-  
plementation is 2.1× faster [30] and the N-point 2D implementation  
can be 5.0× faster. For IDCT, the N-point implementation achieves  
1
.3× speedup and the 2D implementation achieves 4.1× speedup.  
This result demonstrates the efficiency of Algorithm 4.  
3
396 + 1000 × 75  10896 seconds. Meanwhile, among all  
As DCT/IDCT is used in the density operator, in Figure 12, the  
efficiency of the entire density forward and backward procedure is  
compared for GPU and CPU implementations. With all the speedup  
techniques, an average of 15  21× speedup on GPU can be  
achieved with the current implementation over the preliminary DAC  
version [30]. For the parallel CPU implementation, 3.1× runtime  
reduction can be achieved with 40 threads.  
RePlAce runs, initial placement takes 25  30% of the entire global  
placement time, and solving the nonlinear placement takes around  
7
0  75%. The LG of DREAMPlace is also around 10× faster  
than the NTUplace3 legalizer in the RePlAce flow. As NTUplace3  
does the DP for both placers, so the runtime is similar. The speedup  
for the entire placement flow on GPU is 46×, and that on CPU is  
2
7×.  
Figure 7 plots the GP runtime comparison between multi-threaded  
C. Comparison with Solvers in PyTorch  
DREAMPlace and RePlAce with different precisions and implemen-  
tations. It can be seen that the parallel CPU version of DREAMPlace  
is consistently faster than RePlAce. Meanwhile, this TCAD exten-  
sion further improves the efficiency of the GPU implementations  
from the DAC version [30] except for the smallest benchmark  
adaptec1. Figure 8 plots the average runtime ratio for different  
cases. By switching from float64 to float32, an average  
speedup of 1.4× on CPU and 1.3× on GPU can be achieved, while  
the quality stays almost the same. Compared with the previous DAC  
version [30], this extension achieves 1.3× speedup with float64  
and 1.8× speedup with float32. From Figure 8, we also observe  
that the speedup of CPU implementations saturates quickly from  
single thread to 40 threads. This observation holds for both RePlAce  
and DREAMPlace. For RePlAce, the best number of threads is 40  
with a speedup of 3.2×, while for DREAMPlace, 20 threads provide  
the best efficiency with a factor of 5.0×.  
As mentioned, DREAMPlace can enable easy adoption of native  
solvers in PyTorch. Here we compare with the widely-used solvers  
implemented in the toolkit, like Adam [25] and stochastic gradient  
descent (SGD) with momentum, as shown in Table IV. As these  
solvers do not have line search, we add simple learning rate decay  
in each iteration to control the step size of gradient descent with  
the decay factor shown in the “LR Decay” columns. We use the  
default configurations for these solvers and report the final HPWL  
after detailed placement and the runtime for global placement in  
seconds. In our experiments, we find the gradient descent process  
may be unable to converge if the learning rate is not properly  
designed. Therefore, we customize the decay factor for each design.  
It can be seen that Adam can achieve slightly better results than  
the Nesterov’s accelerated gradient decent method (shortened to  
Nesterov’s method for brevity) implemented in RePlAce, while the  
Nesterov’s method converges much faster. Meanwhile, the results for  
SGD with momentum are about 1.2% worse. As the solvers have  
many parameters to tune, it is hard to simply conclude that Adam  
or SGD with momentum is definitely worse than the Nesterov’s  
method with the experiments, but the preliminary results are at  
least promising enough to worth further exploration. With the  
DREAMPlace framework, we can investigate new solvers easily by  
scripting in PyTorch.  
Figure 9 draws the runtime breakdown of DREAMPlace on a 2-  
million-cell design bigblue4, where GP and LG only take 6.2%  
runtime of the entire flow. The runtime of GP and LG is even  
less than that of file IO for benchmark reading and writing. The  
majority of the runtime (82%) is taken by DP, which still relies  
on the external placer currently. Previous studies [39], [40] have  
demonstrated more than 6× speedup from GPU acceleration for DP  
over multi-threaded CPU. While DP is not the focus of this paper,  
there is a potential of 18× speedup for the entire placement by future  
2
400  
D. Routability-Driven Placement  
incorporation of GPU-accelerated DP, e.g., 2  
18 for  
5+9+332/6+45  
bigblue4 according to Table II. On the other hand, within each  
forward and backward pass of GP, the density-related computation  
takes longer than wirelength (73.4% v.s. 26.5%). With efficient  
DCT/IDCT implementation, the electric field computation is no  
longer the bottleneck for density forward and backward.  
To verify the runtime benefits in routability-driven placement, we  
conducted experiments on the DAC 2012 contest benchmarks [41].  
We consider two major metrics for solution quality: “sHPWL” as  
scaled wirelength and “RC” as routing congestion. In the contest,  
the RC is defined as a weighted average of overflows in the top  
0
278-0070 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.  
Authorized licensed use limited to: University of Exeter. Downloaded on June 27,2020 at 06:13:12 UTC from IEEE Xplore. Restrictions apply.  
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCAD.2020.3003843, IEEE  
Transactions on Computer-Aided Design of Integrated Circuits and Systems  
TABLE II: Experimental results on ISPD 2005 benchmarks [38] with float64.  
RePlAce (40 threads)  
Runtime (s)  
DREAMPlace (40 threads)  
Runtime (s)  
DREAMPlace (V100)  
Runtime (s)  
Design  
#cells  
#nets  
HPWL  
HPWL  
HPWL  
GP  
LG  
DP  
Total  
GP  
LG  
DP  
IO  
Total  
GP  
LG  
DP  
IO  
Total  
adaptec1  
adaptec2  
adaptec3  
adaptec4  
bigblue1  
bigblue2  
bigblue3  
bigblue4  
211  
255  
452  
496  
278  
558  
1097  
2177  
221  
266  
467  
516  
284  
577  
1123  
2230  
73.22  
81.86  
193.34  
175.25  
89.87  
138.07  
305.09  
743.80  
80  
159  
297  
336  
130  
299  
787  
1789  
4
7
21  
27  
48  
55  
27  
82  
120  
299  
112  
201  
378  
426  
170  
419  
1030  
2400  
73.22  
82.23  
193.81  
173.85  
89.40  
136.73  
303.89  
743.69  
67  
98  
133  
187  
87  
143  
316  
655  
0.4  
0.5  
1
2
0.3  
9
24  
31  
57  
65  
32  
91  
142  
336  
4
5
9
10  
5
10  
21  
45  
96  
134  
201  
264  
125  
254  
484  
1047  
73.22  
82.22  
193.72  
174.08  
89.38  
136.54  
303.90  
743.75  
5
6
8
9
6
8
14  
25  
0.5  
0.5  
1
2
0.4  
9
25  
31  
57  
65  
31  
95  
142  
332  
4
5
9
9
6
10  
20  
45  
34  
42  
76  
85  
43  
123  
180  
413  
20  
20  
4
22  
41  
51  
3
9
3
9
ratio  
-
-
1.003  
38.2  
10.1  
0.9  
4.6  
1.000  
18.7  
0.9  
1.0  
1.0  
2.7  
1.000  
1.0  
1.0  
1.0  
1.0  
1.0  
TABLE III: Experimental results on industrial benchmarks with float64.  
RePlAce (40 threads)  
Runtime (s)  
DREAMPlace (40 threads)  
Runtime (s)  
DREAMPlace (V100)  
Runtime (s)  
DP  
Design  
#cells  
#nets  
HPWL  
HPWL  
HPWL  
GP  
LG  
DP  
Total  
GP  
LG  
DP  
IO  
Total  
GP  
LG  
IO  
Total  
design1  
design2  
design3  
design4  
design5  
design6  
1345  
1306  
2265  
1525  
1316  
10504  
1389  
1355  
2276  
1528  
1364  
10747  
340.76  
274.65  
524.36  
454.86  
287.46  
NA  
787  
793  
1369  
857  
39  
39  
74  
48  
38  
140  
134  
233  
166  
138  
NA  
1039  
1057  
1777  
1136  
1016  
NA  
340.64  
275.41  
522.68  
453.86  
287.14  
2360.94  
341  
363  
543  
384  
335  
3056  
4
4
14  
8
3
77  
173  
166  
299  
200  
167  
1650  
30  
30  
48  
33  
29  
549  
564  
906  
626  
535  
5037  
340.67  
275.36  
522.62  
453.83  
287.11  
2358.44  
17  
17  
27  
18  
17  
4
5
14  
8
4
76  
172  
167  
302  
202  
169  
30  
29  
48  
33  
31  
224  
218  
393  
262  
221  
2184  
776  
10896  
NA  
246  
181  
1666  
253  
ratio  
-
-
1.001  
47.3  
8.1  
0.8  
4.6  
1.000  
19.9  
1.0  
1.0  
1.0  
2.4  
1.000  
1.0  
1.0  
1.0  
1.0  
1.0  
1
1
1
1
1
1
05  
04  
03  
02  
01  
00  
RePlAce  
DREAMPlace  
float64, 40 Threads  
float64, 40 Threads  
DAC-float64, V100  
float32, 40 threads  
TCAD-float64, V100  
TCAD-float32, V100  
Fig. 7: GP runtime comparison for ISPD2005 and industrial benchmarks between various implementations and precisions. The runtime of  
design6 for RePlAce for different number of threads is estimated with the method mentioned in first paragraph of Section IV-A.  
1
1
1
1
03  
2
4.4 Density Backward  
.0 WL Backward  
CPU Threads  
, 10, 20, 40  
2.9
⇥  
V100  
IO  
3
1
1
1.7  
Electric Field 15.7  
02  
4
GP  
LG  
4
.2  
Density Forward  
2.0  
Density Map  
33.3  
1
9.1
⇥  
1
3.6
⇥  
82.1  
01  
DP  
2
3.5 WL Forward  
1
.3
⇥  
1
.0
⇥  
00  
0.7
⇥  
(
a)  
(b)  
Fig. 9: Runtime breakdown in percentages of DREAMPlace with  
float32 on V100 (a) for bigblue4 and (b) one forward and  
backward pass in GP.  
1  
0
1
following equation [41],  
Fig. 8: Average GPU runtime ratio for ISPD2005 and industrial  
benchmarks with different number of CPU threads. Normalized  
by the runtime of the TCAD version of DREAMPlace on V100  
with float64, which is consistent with the ratios in Table II  
and Table III. The normalized ratios for 40 threads and GPUs are  
annotated for easier comparison.  
sHPWL = HPWL × (1 + 003 × (RC 100)),  
indicating that unit increase in routing congestion is counted as 3%  
HPWL overhead.  
In this experiment, we obtained the RePlAce binary from the  
authors of [8] to keep consistent experimental settings. Table V  
shows the solution quality and runtime. As NCTUgr is repeatedly  
invoked as an external congestion estimator and it only runs on  
CPU with single-thread, we separate the runtime of GP into two  
parts: nonlinear optimization (“NL”) and global routing (“GR”).  
(20)  
0
.5%, 1%, 2%, 5% congested tiles. The minimum value for RC  
is 100, indicating no overflow. The sHPWL is computed using the  
0
278-0070 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.  
Authorized licensed use limited to: University of Exeter. Downloaded on June 27,2020 at 06:13:12 UTC from IEEE Xplore. Restrictions apply.  
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCAD.2020.3003843, IEEE  
Transactions on Computer-Aided Design of Integrated Circuits and Systems  
1
1
0
0
.5  
.0  
.5  
.0  
2.0  
Net-By-Net  
Atomic  
Merged  
DAC-float64  
TCAD-float32  
1
1
0
.5  
.0  
.5  
TCAD-float64  
0.0  
(a)  
(a)  
2
1
1
0
0
.0  
.5  
.0  
.5  
.0  
Net-By-Net  
Atomic  
Merged  
DREAMPlace Threads  
1
40  
1
1
1
1
/2  
/4  
/8  
(b)  
(b)  
Fig. 12: Density forward and backward comparison. (a) GPU runtime  
comparison between DAC [30] and this extension. (b) CPU runtime  
comparison between single thread and 40 threads with float32.  
DREAMPlace Threads  
1
40  
1
1
1
1
/2  
/4  
/8  
TABLE IV: Comparison with native PyTorch solvers like Adam  
[
25] and SGD with momentum with float64 on GPU.  
Design  
Nesterov [8]  
GP  
Adam  
GP  
(s)  
SGD Momentum  
GP  
LR  
Decay  
LR  
HPWL  
HPWL  
HPWL  
(s)  
(s)  
Decay  
adaptec1  
adaptec2  
adaptec3  
adaptec4  
bigblue1  
bigblue2  
bigblue3  
bigblue4  
73.22  
82.22  
193.72  
174.08  
89.38  
136.54  
303.90  
743.75  
5
6
8
9
6
8
14  
25  
73.02  
82.44  
191.22  
172.84  
89.89  
136.43  
302.95  
740.60  
8
7
0.995  
0.995  
0.995  
0.995  
0.995  
0.995  
0.997  
0.997  
73.84  
83.72  
198.07  
175.77  
89.64  
137.48  
312.79  
744.55  
8
9
0.993  
0.993  
0.993  
0.993  
0.993  
0.993  
0.995  
0.995  
(
c)  
12  
13  
10  
14  
33  
66  
12  
14  
9
14  
24  
57  
Fig. 10: Wirelength forward and backward with float32. (a) GPU  
runtime comparison of different implementations. (b) CPU runtime  
comparison of different implementations with 40 threads. (c) CPU  
runtime comparison of the net-by-net strategy between single thread  
and 40 threads.  
ratio  
1.000  
1.000  
0.997  
1.781  
1.012  
1.687  
2
1
1
0
0
.0  
.5  
.0  
.5  
.0  
2.0  
1.5  
1.0  
0.5  
0.0  
V. CONCLUSION  
DCT-2N  
DCT-N  
DCT-2D-N  
IDCT-2N  
IDCT-N  
IDCT-2D-N  
In this paper, we take a new perspective on solving classical  
analytical placement by casting it into a neural network training  
problem. Leveraging the deep learning toolkit PyTorch, we de-  
velop a new open-source placement engine, DREAMPlace with GPU  
acceleration. It achieves around 40× speedup in global placement  
without quality degradation for academic and industrial benchmarks,  
compared to the state-of-the-art RePlAce running on many threads.  
We explore different implementations of low-level operators for  
forward and backward propagation to boost the overall efficiency.  
Furthermore, DREAMPlace is highly extensible to incorporate  
new algorithms/solvers and new objectives by simply writing high-  
level programming languages such as Python. We plan to further  
investigate cell inflation for routability and net weighting for timing  
optimization [29], [35], [37] as well as GPU-accelerated detailed  
placement. It can also be extended to leverage multi-GPU platforms  
for further speedup. Meanwhile, we plan to investigate the efficiency  
of implementations using fixed point numbers to guarantee run-  
to-run determinism. As DREAMPlace decouples the high-level  
algorithmic design and low-level acceleration efforts, we believe  
this work shall open up new directions for revisiting classical EDA  
problems.  
5
12  
1024  
2048  
4096  
512  
1024  
2048  
4096  
Map Size (N × N)  
Map Size (N × N)  
(a)  
(b)  
Fig. 11: GPU runtime comparison of DCT/IDCT algorithms with  
float32.  
NTUplace3 [4] is adopted as the LG and DP for RePlAce, and  
DP for DREAMPlace. We can see that DREAMPlace with GPU  
acceleration can provide very similar solution quality, while 20×  
faster in NL and 9× faster in GP including the runtime of the global  
router. For the entire placement flow, we can achieve 5× speedup.  
DREAMPlace also shows compelling efficiency and quality with 40  
threads on CPU. We also observe that DREAMPlace invokes the  
global router less often than RePlAce, leading to shorter GR time.  
Meanwhile, for DREAMPlace, GR takes around 70% of the GP  
time, which is the runtime bottleneck.  
0
278-0070 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.  
Authorized licensed use limited to: University of Exeter. Downloaded on June 27,2020 at 06:13:12 UTC from IEEE Xplore. Restrictions apply.  
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCAD.2020.3003843, IEEE  
Transactions on Computer-Aided Design of Integrated Circuits and Systems  
TABLE V: Experimental results on DAC 2012 benchmarks [41] for routability-driven placement.  
RePlAce  
DREAMPlace (40 threads)  
Runtime (s)  
DREAMPlace (RTX 2080TI)  
Runtime (s)  
Runtime (s)  
LG  
Design  
#nodes  
#nets  
sHPWL  
RC  
GP  
sHPWL  
RC  
GP  
sHPWL  
RC  
GP  
NL GR  
DP  
Total  
LG  
DP  
Total  
LG  
DP  
Total  
NL  
GR  
NL  
GR  
SB2  
SB3  
SB6  
SB7  
SB9  
SB11  
SB12  
SB14  
SB16  
SB19  
1014K  
920K  
1014K  
1365K  
847K  
955K  
1293K  
635K  
699K  
523K  
991K  
898K  
1007K  
1340K  
834K  
936K  
1293K  
620K  
697K  
512K  
62.39  
30.69  
31.30  
37.20  
21.48  
34.28  
26.69  
21.26  
25.57  
14.21  
102.47  
100.81  
100.61  
101.13  
101.09  
102.65  
103.02  
100.75  
102.29  
101.05  
6981  
2354  
1874  
2068  
1866  
2676  
3040  
740  
2168  
969  
548  
438  
369  
549  
441  
188  
539  
257  
46  
70  
44  
54  
23  
28  
153  
22  
16  
17  
160  
149  
144  
201  
148  
108  
230  
87  
9382  
3565  
2634  
2794  
2426  
3385  
3898  
1052  
2331  
1685  
61.06  
30.18  
30.92  
36.73  
21.21  
32.86  
26.90  
21.25  
25.42  
15.10  
101.57  
100.73  
100.26  
100.60  
100.61  
100.86  
101.25  
100.55  
101.77  
103.28  
3953  
3306  
1888  
963  
1200  
524  
309  
144  
87  
214  
319  
146  
119  
108  
30  
16  
27  
23  
11  
24  
5
15  
2
1
183  
172  
168  
234  
170  
125  
278  
104  
105  
126  
5390  
4040  
2414  
1395  
964  
1602  
3398  
1345  
891  
61.20  
30.18  
31.00  
36.73  
21.23  
32.80  
26.38  
21.24  
25.53  
14.67  
101.76  
100.65  
100.33  
100.61  
100.65  
100.79  
100.72  
100.51  
101.94  
102.73  
293  
131  
169  
78  
1215  
485  
309  
143  
83  
283  
398  
148  
115  
57  
31  
16  
27  
23  
11  
24  
5
15  
2
1
184  
182  
168  
233  
171  
125  
278  
108  
106  
133  
1746  
835  
694  
509  
337  
603  
883  
349  
283  
274  
677  
54  
1218  
2767  
1067  
649  
150  
171  
65  
44  
71  
1669  
1288  
91  
110  
701  
948  
ratio  
1.010  
1.005  
21.6  
2.7  
6.2  
0.8  
5.4  
1.004  
1.001  
14.0  
1.1  
1.0  
1.0  
3.4  
1.000  
1.000  
1.0  
1.0  
1.0  
1.0  
1.0  
Both results for RePlAce and DREAMPlace are collected from a Linux machine with two 20-core Intel Xeon Gold 6230 CPUs (40 cores in total) and 1 NVIDIA RTX 2080TI GPU.  
We obtain the binary of RePlAce [8] to keep consistent experimental settings for this benchmark suite. As the RePlAce binary uses float32 for nonlinear placement, we use the same setting for DREAMPlace  
in this experiment. The binary also only supports single-thread and the external global router NCTUgr is also single-thread.  
ACKNOWLEDGE  
[13] T. Lin, C. Chu, J. R. Shinnerl, I. Bustany, and I. Nedelchev, “POLAR: A  
high performance mixed-size wirelengh-driven placer with density con-  
straints,” IEEE Transactions on Computer-Aided Design of Integrated  
Circuits and Systems (TCAD), vol. 34, no. 3, pp. 447–459, 2015.  
[14] M.-C. Kim, D.-J. Lee, and I. L. Markov, “SimPL: An effective  
placement algorithm,” IEEE Transactions on Computer-Aided Design  
of Integrated Circuits and Systems (TCAD), vol. 31, no. 1, pp. 50–60,  
The authors would like to thank Mr. Lutong Wang and Mr.  
Ilgweon Kang from the University of California at San Diego  
for preparing the RePlAce binary, suggestions on the experimental  
setups, and verifying the results.  
This project is supported in part by NVIDIA.  
2
012.  
15] M.-C. Kim, N. Viswanathan, C. J. Alpert, I. L. Markov, and S. Ramji,  
MAPLE: multilevel adaptive placement for mixed-size designs,” in  
ACM International Symposium on Physical Design (ISPD). IEEE,  
012, pp. 193–200.  
16] T. Lin, C. Chu, and G. Wu, “POLAR 3.0: An ultrafast global placement  
engine,” in IEEE/ACM International Conference on Computer-Aided  
Design (ICCAD). IEEE, 2015, pp. 520–527.  
17] W. Li, Y. Lin, and D. Z. Pan, “elfPlace: Electrostatics-based placement  
for large-scale heterogeneous fpgas,” in IEEE/ACM International Con-  
ference on Computer-Aided Design (ICCAD). Westminster, CO: IEEE  
Press, November 2019.  
[
REFERENCES  
[
1] A. B. Kahng, S. Reda, and Q. Wang, “Architecture and details of a  
high quality, large-scale analytical placer,” in IEEE/ACM International  
Conference on Computer-Aided Design (ICCAD). IEEE, 2005, pp.  
2
[
[
8
91–898.  
[
[
[
2] T. Chan, J. Cong, and K. Sze, “Multilevel generalized force-directed  
method for circuit placement,” in ACM International Symposium on  
Physical Design (ISPD). ACM, 2005, pp. 185–192.  
3] A. B. Kahng and Q. Wang, “A faster implementation of APlace,” in  
ACM International Symposium on Physical Design (ISPD). ACM,  
006, pp. 218–220.  
4] T.-C. Chen, Z.-W. Jiang, T.-C. Hsu, H.-C. Chen, and Y.-W. Chang,  
NTUplace3: An analytical placer for large-scale mixed-size designs  
[
[
[
18] “Cadence SOC Encounter, http://www.cadence.com.  
19] “Synopsys IC Compiler,” http://www.synopsys.com.  
20] A. Ludwin, V. Betz, and K. Padalia, “High-quality, deterministic parallel  
placement for FPGAs on commodity hardware,” in ACM Symposium  
on FPGAs. ACM, 2008, pp. 14–23.  
2
with preplaced blocks and density constraints,” IEEE Transactions on  
Computer-Aided Design of Integrated Circuits and Systems (TCAD),  
vol. 27, no. 7, pp. 1228–1240, 2008.  
[
21] W. Li, M. Li, J. Wang, and D. Z. Pan, “UTPlaceF 3.0: A parallelization  
framework for modern FPGA global placement,” in IEEE/ACM Interna-  
tional Conference on Computer-Aided Design (ICCAD). IEEE, 2017,  
pp. 922–928.  
[
5] M.-K. Hsu, Y.-F. Chen, C.-C. Huang, S. Chou, T.-H. Lin, T.-C. Chen,  
and Y.-W. Chang, “NTUplace4h: A novel routability-driven placement  
algorithm for hierarchical mixed-size circuit designs,” IEEE Transac-  
tions on Computer-Aided Design of Integrated Circuits and Systems  
[
[
[
22] J. Cong and Y. Zou, “Parallel multi-level analytical global placement  
on graphics processing units,” in IEEE/ACM International Conference  
on Computer-Aided Design (ICCAD). ACM, 2009, pp. 681–688.  
23] C.-X. Lin and M. D. Wong, “Accelerate analytical placement with GPU:  
A generic approach,” in IEEE/ACM Proceedings Design, Automation  
and Test in Eurpoe (DATE). IEEE, 2018, pp. 1345–1350.  
24] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan,  
T. Killeen, Z. Lin, N. Gimelshein, L. Antiga et al., “PyTorch: An imper-  
ative style, high-performance deep learning library,” in Conference on  
Neural Information Processing Systems (NeurIPS). Curran Associates,  
Inc., 2019, pp. 8024–8035.  
(
TCAD), vol. 33, no. 12, pp. 1914–1927, 2014.  
[
[
6] J. Lu, P. Chen, C.-C. Chang, L. Sha, D. J.-H. Huang, C.-C. Teng,  
and C.-K. Cheng, “ePlace: Electrostatics-based placement using fast  
fourier transform and nesterov’s method,” ACM Transactions on Design  
Automation of Electronic Systems (TODAES), vol. 20, no. 2, p. 17, 2015.  
7] J. Lu, H. Zhuang, P. Chen, H. Chang, C. Chang, Y. Wong, L. Sha,  
D. Huang, Y. Luo, C. Teng, and C. Cheng, “ePlace-MS: Electrostatics-  
based placement for mixed-size circuits,” IEEE Transactions on  
Computer-Aided Design of Integrated Circuits and Systems (TCAD),  
vol. 34, no. 5, pp. 685–698, 2015.  
[
25] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,”  
[
[
8] C. Cheng, A. B. Kahng, I. Kang, and L. Wang, “RePlAce: Advancing  
solution quality and routability validation in global placement,” IEEE  
Transactions on Computer-Aided Design of Integrated Circuits and  
Systems (TCAD), vol. 38, no. 9, pp. 1717–1730, 2019.  
9] Z. Zhu, J. Chen, Z. Peng, W. Zhu, and Y.-W. Chang, “Generalized  
augmented lagrangian and its applications to VLSI global placement,”  
in ACM/IEEE Design Automation Conference (DAC). IEEE, 2018, pp.  
in International Conference on Learning Representations (ICLR), 2015.  
[26] I. Goodfellow, Y. Bengio, A. Courville, and Y. Bengio, Deep learning.  
MIT press Cambridge, 2016, vol. 1.  
[27] M.-K. Hsu, Y.-W. Chang, and V. Balabanov, “TSV-aware analytical  
placement for 3D IC designs,” in ACM/IEEE Design Automation  
Conference (DAC). ACM, 2011, pp. 664–669.  
[28] M.-K. Hsu, V. Balabanov, and Y.-W. Chang, “TSV-aware analytical  
placement for 3-D IC designs based on a novel weighted-average  
wirelength model,” ACM/IEEE Design Automation Conference (DAC),  
vol. 32, no. 4, pp. 497–509, 2013.  
[29] W. C. Naylor, R. Donelly, and L. Sha, “Non-linear optimization system  
and method for wire length and delay optimization for an automatic  
electric circuit placer,” Oct. 9 2001, US Patent 6,301,693.  
[30] Y. Lin, S. Dhar, W. Li, H. Ren, B. Khailany, and D. Z. Pan, “DREAM-  
Place: Deep learning toolkit-enabled textGPU acceleration for modern  
VLSI placement,” in ACM/IEEE Design Automation Conference (DAC).  
ACM, 2019, pp. 1–6.  
1
–6.  
[
[
[
10] N. Viswanathan, M. Pan, and C. Chu, “FastPlace 3.0: A fast multilevel  
quadratic placement algorithm with placement congestion control,” in  
IEEE/ACM Asia and South Pacific Design Automation Conference  
(
ASPDAC). IEEE, 2007, pp. 135–140.  
11] X. He, T. Huang, L. Xiao, H. Tian, and E. F. Y. Young, “Ripple: A  
robust and effective routability-driven placer,” IEEE Transactions on  
Computer-Aided Design of Integrated Circuits and Systems (TCAD),  
vol. 32, no. 10, pp. 1546–1556, 2013.  
12] T. Lin, C. Chu, J. R. Shinnerl, I. Bustany, and I. Nedelchev, “PO-  
LAR: placement based on novel rough legalization and refinement,”  
in IEEE/ACM International Conference on Computer-Aided Design  
[31] K. A. Berman and J. Paul, Fundamentals of Sequential and Parallel  
(
ICCAD). IEEE, 2013, pp. 357–362.  
Algorithms, 1st ed. Boston, MA, USA: PWS Publishing Co., 1996.  
0
278-0070 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.  
Authorized licensed use limited to: University of Exeter. Downloaded on June 27,2020 at 06:13:12 UTC from IEEE Xplore. Restrictions apply.  
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCAD.2020.3003843, IEEE  
Transactions on Computer-Aided Design of Integrated Circuits and Systems  
[
32] J. Makhoul, “A fast cosine transform in one and two dimensions,” IEEE  
Transactions on Signal Processing, vol. 28, no. 1, pp. 27–34, 1980.  
33] F. Zou, L. Shen, Z. Jie, W. Zhang, and W. Liu, “A sufficient condition  
for convergences of Adam and RMSProp,” in IEEE Conference on  
Computer Vision and Pattern Recognition (CVPR). IEEE, 2019, pp.  
Jiaqi Gu received the B.E. degree in Microelec-  
tronic Science and Engineering from Fudan Uni-  
versity, Shanghai, China in 2018. He is currently a  
post-graduate student studying for his Ph.D. degree  
in the Department of Electrical and Computer Engi-  
neering, The University of Texas at Austin, Austin,  
TX, USA, under the supervision of Prof. David Z.  
Pan. His current research interests include machine  
learning, algorithm and architecture design, optical  
neuromorphic computing for AI acceleration, and  
GPU acceleration for VLSI physical design automa-  
tion. He has received the Best Paper Reward at ASP-DAC 2020.  
[
1
1 127–11 135.  
[
34] P. Spindler, U. Schlichtmann, and F. M. Johannes, “Abacus: fast  
legalization of standard cell circuits with minimal movement,” in ACM  
International Symposium on Physical Design (ISPD). ACM, 2008, pp.  
4
7–53.  
[
[
35] T. F. Chan, K. Sze, J. R. Shinnerl, and M. Xie, “mPL6: Enhanced  
multilevel mixed-size placement with congestion control,” in Modern  
Circuit Placement. Springer, 2007.  
36] W.-H. Liu, W.-C. Kao, Y.-L. Li, and K.-Y. Chao, “NCTU-GR 2.0:  
multithreaded collision-aware global routing with bounded-length maze  
routing,” IEEE Transactions on Computer-Aided Design of Integrated  
Circuits and Systems (TCAD), vol. 32, no. 5, pp. 709–722, 2013.  
37] A. B. Kahng and Q. Wang, “An analytic placer for mixed-size  
placement and timing-driven placement,” in IEEE/ACM International  
Conference on Computer-Aided Design (ICCAD). IEEE, 2004, pp.  
Wuxi Li (S’18–M’19) received the B.S. degree  
in microelectronics from Shanghai Jiao Tong Uni-  
versity, Shanghai, China, in 2013., the M.S. and  
Ph.D. degrees in computer engineering from the  
University of Texas at Austin, Austin, TX, in 2015  
and 2019, respectively. He is currently a Staff  
Software Engineer in the Vivado Implementation  
Team at Xilinx, San Jose, CA, where he is primarily  
working on the physical synthesis field.  
[
[
5
65–572.  
38] G.-J. Nam, C. J. Alpert, P. Villarrubia, B. Winter, and M. Yildiz,  
The ISPD2005 placement contest and benchmark suite,” in ACM  
International Symposium on Physical Design (ISPD). ACM, 2005,  
pp. 216–220.  
Dr. Li has received the Best Paper Award at DAC  
2019, the Silver Medal in ACM Student Research  
Contest at ICCAD 2018, and the 1st-place awards in the FPGA placement  
contests of ISPD 2016 and 2017.  
[
[
39] S. Dhar and D. Z. Pan, “GDP: GPU accelerated detailed placement,”  
in IEEE High Performance Extreme Computing Conference (HPEC).  
IEEE, 2018, pp. 1–7.  
40] Y. Lin, W. Li, J. Gu, H. Ren, B. Khailany, and D. Z. Pan, “ABCD-  
Place: Accelerated batch-based concurrent detailed placement on multi-  
threaded cpus and gpus,” IEEE Transactions on Computer-Aided Design  
of Integrated Circuits and Systems (TCAD), 2020.  
Shounak Dhar obtained his B.Tech in Electrical  
Engineering from IIT Bombay in 2014 and PhD  
in Electrical and Computer Engineering from the  
University of Texas at Austin in 2019. His research  
interests include Electronic Design Automation and  
Hardware Acceleration. He is currently working  
at Intel Corporation on EDA algorithms in Intel’s  
FPGA design implementation tool.  
[
41] N. Viswanathan, C. Alpert, C. Sze, Z. Li, and Y. Wei, “The DAC  
2
012 routability-driven placement contest and benchmark suite,” in  
ACM/IEEE Design Automation Conference (DAC). ACM, 2012, pp.  
74–782.  
7
Haoxing Ren (M’00–SM’09) received his  
B.S/M.S. degrees in Electrical Engineering  
from Shanghai Jiao Tong University, his M.S.  
degree in Computer Engineering from Rensselaer  
Polytechnic Institute, and his PhD degree in  
Computer Engineering from University of Texas  
at Austin. From 2000 to 2006, he was a software  
engineer with IBM Microelectronics. From 2007  
to 2015, he was a Research Staff Member with  
IBM T. J. Watson Research Center. From 2015 to  
Yibo Lin (S’16–M’19) received the B.S. degree  
in microelectronics from Shanghai Jiaotong Uni-  
versity in 2013, and his Ph.D. degree from the  
Electrical and Computer Engineering Department  
of the University of Texas at Austin in 2018. He  
is current an assistant professor in the Computer  
Science Department associated with the Center for  
Energy-Efficient Computing and Applications at  
Peking University, China. His research interests  
include physical design, machine learning applica-  
tions, GPU acceleration, and hardware security. He  
2016, he was a technical executive with SuZhou  
PowerCore Technology. He is currently a Principal Research Scientist at  
NVIDIA. His research interests are machine learning applications in design  
automation and GPU accelerated EDA. He received many IBM technical  
achievement rewards including the IBM Corporate Award for his work on  
improving microprocessor design productivity. He holds over twenty patents  
and co-authored more than 40 papers including several book chapters in  
physical design and logic synthesis. He has received the Best Paper Awards  
at ISPD’13 and DAC’19.  
has received 4 Best Paper Awards at premier venues (ISPD 2020, DAC 2019,  
VLSI Integration 2018, and SPIE 2016). He has also served in the Technical  
Program Committees of many major conferences, including ICCAD, ICCD,  
ISPD, and DAC.  
Brucek Khailany (M’00–SM’13) received the the  
Ph.D. degree from Stanford University in Stanford,  
CA in 2003 and the B.S.E. degree from the Uni-  
versity of Michigan in Ann Arbor, MI in 1997, in  
electrical engineering. He joined NVIDIA in 2009  
and is currently the Director of the ASIC and VLSI  
Research group. He leads research into innovative  
design methodologies for integrated circuit (IC)  
development, machine learning (ML) and GPU-  
assisted electronic design automation (EDA) algo-  
rithms, and energy-efficient ML accelerators. Over  
0 years at NVIDIA, he has contributed to many projects in research and  
Zixuan Jiang received the B.E. degree in electronic  
information engineering from Zhejiang University,  
Hangzhou, China in 2018. He is currently pursuing  
his Ph.D. degree in the Department of Electrical and  
Computer Engineering of the University of Texas at  
Austin. His current research interests involve ma-  
chine learning frameworks and applications, phys-  
ical design algorithms and implementations.  
1
product groups spanning computer architecture and VLSI design. Previously,  
from 2004-2009, he was a Co-Founder and Principal Architect at Stream  
Processors, Inc (SPI) where he led research and development activities related  
to parallel processor architectures.  
0
278-0070 use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.  
Authorized licensed use limited to: University of Exeter. Downloaded on June 27,2020 at 06:13:12 UTC from IEEE Xplore. Restrictions apply.  
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCAD.2020.3003843, IEEE  
Transactions on Computer-Aided Design of Integrated Circuits and Systems  
David Z. Pan (S’97—M’00—SM’06—F’14) re-  
ceived his B.S. degree from Peking University,  
and his M.S. and Ph.D. degrees from University  
of California, Los Angeles (UCLA). From 2000  
to 2003, he was a Research Staff Member with  
IBM T. J. Watson Research Center. He is currently  
Engineering Foundation Professor at the Depart-  
ment of Electrical and Computer Engineering, The  
University of Texas at Austin, Austin, TX, USA.  
His research interests include cross-layer nanometer  
IC design for manufacturability, reliability, security,  
machine learning and hardware acceleration, design/CAD for analog/mixed  
signal designs and emerging technologies. He has published over 375 journal  
articles and refereed conference papers, and is the holder of 8 U.S. patents.  
He has served as a Senior Associate Editor for ACM Transactions on  
Design Automation of Electronic Systems (TODAES), an Associate Editor  
for IEEE Transactions on Computer Aided Design of Integrated Circuits and  
Systems (TCAD), IEEE Transactions on Very Large Scale Integration Sys-  
tems (TVLSI), IEEE Transactions on Circuits and Systems PART I (TCAS-I),  
IEEE Transactions on Circuits and Systems PART II (TCAS-II), IEEE Design  
&
Test, Science China Information Sciences, Journal of Computer Science  
and Technology, IEEE CAS Society Newsletter, etc. He has served in the  
Executive and Program Committees of many major conferences, including  
DAC, ICCAD, ASPDAC, and ISPD. He is the ASPDAC 2017 Program  
Chair, ICCAD 2019 General Chair, DAC 2014 Tutorial Chair, and ISPD  
2
008 General Chair  
He has received a number of prestigious awards for his research contribu-  
tions, including the SRC Technical Excellence Award in 2013, DAC Top 10  
Author in Fifth Decade, DAC Prolific Author Award, ASP-DAC Frequently  
Cited Author Award, 19 Best Paper Awards at premier venues (ISPD 2020,  
ASPDAC 2020, DAC 2019, GLSVLSI 2018, VLSI Integration 2018, HOST  
2
017, SPIE 2016, ISPD 2014, ICCAD 2013, ASPDAC 2012, ISPD 2011,  
IBM Research 2010 Pat Goldberg Memorial Best Paper Award, ASPDAC  
2
2
010, DATE 2009, ICICDT 2009, SRC Techcon in 1998, 2007, 2012 and  
015) and 15 additional Best Paper Award finalists, Communications of the  
ACM Research Highlights (2014), UT Austin RAISE Faculty Excellence  
Award (2014), Cadence Academic Collaboration Award (2019), and many  
international CAD contest awards, among others. He is a Fellow of IEEE  
and SPIE.  
0
278-0070 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.  
Authorized licensed use limited to: University of Exeter. Downloaded on June 27,2020 at 06:13:12 UTC from IEEE Xplore. Restrictions apply.  
仅选取文字
重要
更重要
最重要
关键内容
同意
不同意
您已达到评论上限
基本用户只能在每文件上写3条评论。
升级至LINER Premium,并撰写无限数量的评论。
升级

Premium版本提供的功能

观看广告,解锁所有Premium功能。

立即观看